Describe the null hypotheses to which the -values given in Table 3.4 correspond. explain what conclusions you can draw based on these -values. Your explanation should be phrased in terms of sales, tv, radio, and newspaper.
We can conclude that - in the absence of any tv, radio, or newspaper sales, sales of the product willbe non-zero. - there is a relationship between tv expenditure and sales, and data suggest that increment in sales due to one unit increment in tv or radio expenditure is statistically significant and not due to random chance assuming the null hypothesis is true. - no statistically significant relationship was found between expenditure on newspaper advertising and sales.
To simplify, the null hypothesis assumes that there is no statistically significant relationship between sales and expenditure on tv, newspaper, or radio. The p-value associated with each of these predictors is the conditional probability of observing a relationship between sales and the predictor assuming the null hypothesis is true. This probability is very high due for the intercept, TV, and radio but very large for newpspaper, which means we cannot reject the null hypothesis for newspaper but can reject the null hypothesis in favour of the alternate hypothesis for tv, radio, and intercept.
Carefully explain the differences between the KNN classifier and KNN regression methods. Both are non-parametric/instance-based approach for assigning a response to a data point based on the response of some nearest neighbors.
However, the nature of the response and the method for deriving it is different. The KNN regression method considers the nearest data points for and assumes = . The response is continuous.
The KNN classifier considers the nearest data points for and assumes the predicted class for is the most frequently occurring/majority/modal class of the neighbors. More specifically, it calculates the probability that belongs to a particular class based on the frequency of classes in the neigbhors. The response is categorical or discrete.
Suppose we have a dataset with five predictors: = GPA, = IQ, = Level (1 for College, 0 for High School), = Interaction between GPA and IQ, $X_5% = Interaction between GPA and Level. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model and get = 50, = 20, = 0.07, = 35, = 0.01, = -10.
Which answer is correct and why? 1. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates. 2. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates. 3. For a fixed value of IQ and GPA, high school graduates earn more, on average, than high school graduates provided that the GPA is high enough. 4. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates provided that the GPA is high enough.
The regression equation is $$ = _0 + _1 GPA + _2 IQ + _3 Level + _4 GPA IQ + _5 GPA Level
\ = _0 + (_1 + _4 IQ + _5Level)GPA + _2 IQ + _3 Level
\ = 50 + (20 + 0.01IQ - 10 Level) GPA + 20 IQ + 35 Level $$ For the same value of IQ and GPA, the regression equation changes as follows with
iq <- seq(80, 130, 0.1)
gpa <- seq(2.0, 4.0, 0.1)
iq.gpa.df <- data.frame(
iq = rep(iq, each = length(gpa)),
gpa = rep(gpa, length(iq))
)
# iq.gpa.df %>% dplyr::group_by(iq) %>% dplyr::summarize(min_gpa = min(gpa), max_gpa = max(gpa), total = n())
iq.gpa.df <- iq.gpa.df %>%
dplyr::mutate(
earnings_school = 50 + (20 + 0.01 * iq) * gpa + 20 * iq,
earnings_college = 85 + (10 + 0.01 * iq) * gpa + 20 * iq
)
iq.gpa.df %>% dplyr::filter(earnings_school > earnings_college) %>%
dplyr::summarize(min_iq = min(iq), max_iq = max(iq),
min_gpa = min(gpa), max_gpa = max(gpa),
max_diff = max(earnings_school - earnings_college))
## min_iq max_iq min_gpa max_gpa max_diff
## 1 80 130 3.5 4 5
This elementary analysis shows that if the GPA is high enough, high school students can outearn college students although not by a large amount.
Predict the salary of a college graduate with IQ of 110 and GPA of 4.0.
iq <- 110
gpa <- 4.0
is.college <- 1.0
salary <- 50 + 20 * gpa + 0.07 * iq + 35 * is.college + 0.01 * gpa * iq - 10 * gpa * is.college
print(paste0("Salary for such a student is $", salary * 1000))
## [1] "Salary for such a student is $137100"
True or False: Since the coefficient for GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.
False. Significance is not determined by the magnitude of the coefficient. Significance is determined by the value. Even if the coefficient of the GPA/IQ interaction term is very small, in the absence of a value we can’t say whether this is evidence of an interaction effect.
I collect a set of data ( = 100 observations) containing a simple predictor and quantitative response. then fit a linear regression model to the data, as well as a separate cubic regression i.e. Y =
Suppose the true relationship between and is linear i.e .. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
We’d expect the training RSS to be higher for the cubic regressor than the linear regressor. Addition of second and third order polynomial features will increase variance and allow the model to overfit the data, which will decrease training RSS.
Answer (a) using test rather than training RSS Test RSS will be much higher for the cubic regressor compared to the linear regressor because the true underlying relatioship is linear whereas the cubic regressor has learnt to overfit to noise in the training data.
Suppose the true relationship between and is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear and cubic regressions. Would we expect one to be lower tha nthe other, to be the same ,or is there not enough information to tell? Justify your answer. Will still expect training RSS to be lower for cubic regression than linear regression. The extent to which the errors differ depend on just how non-linear the data is. If the data is very non-linear, then the cubic regressor will have a much lower training RSS. If the data is only slightly non-linear, then the cubic regressor will have a slightly lower training RSS.
Answer (c) using test RSS rather than training RSS. Test RSS for cubic regressor will be lower than linear regressor’s. Again, extent of improvement depends on just how non-linear the data is and whether the non-linearity is closer to a straight line than it is to a 3rd degree polynomial.
Use the lm function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary function to print the results. Comment on the output.
lm.mpg <- lm(mpg ~ horsepower, data = Auto)
summary(lm.mpg)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Plot the response and the predictor. Use the abline function to display the least squares regression line.
plot(Auto$horsepower, Auto$mpg, main = 'Auto Dataset - MPG against Horsepower')
abline(lm.mpg$coefficients, col = 'red')
Use the plot function to produce diagnositc plots of the least squares regression fit. Comment on any problems you see with the fit.
par(mfrow = c(2, 2))
plot(lm.mpg)
- Residuals show evidence of non-linearity which means we may want to use polynomial regression. - Standardized residuals show that there isn’t constant variance in the error terms. - Some data points also have a large standardized residual and leverage, and thus closer to Cook’s distance –> we may want to review these data points.
The question involves the use of the multiple linear regression on the Auto dataset. ### Part (a) Produce a scatterplot matrix which includes all of the variables in the dataset.
plot(Auto, col = 'blue', main = "Auto Dataset - Variable pairplot")
Compute the matrix of correlations between the variables using the function cor.
round(cor(Auto[, -9]), 3)
## mpg cylinders displacement horsepower weight acceleration
## mpg 1.000 -0.778 -0.805 -0.778 -0.832 0.423
## cylinders -0.778 1.000 0.951 0.843 0.898 -0.505
## displacement -0.805 0.951 1.000 0.897 0.933 -0.544
## horsepower -0.778 0.843 0.897 1.000 0.865 -0.689
## weight -0.832 0.898 0.933 0.865 1.000 -0.417
## acceleration 0.423 -0.505 -0.544 -0.689 -0.417 1.000
## year 0.581 -0.346 -0.370 -0.416 -0.309 0.290
## origin 0.565 -0.569 -0.615 -0.455 -0.585 0.213
## year origin
## mpg 0.581 0.565
## cylinders -0.346 -0.569
## displacement -0.370 -0.615
## horsepower -0.416 -0.455
## weight -0.309 -0.585
## acceleration 0.290 0.213
## year 1.000 0.182
## origin 0.182 1.000
Use the lm function to perform multiple linear regression with mpg as the response and all other variables except name as the predictors. Use summary to print the results. Comment on the output.
# After considering solutions at https://rpubs.com/lmorgan95/ISLR_CH3_Solutions, might be a good idea to encode the origin feature manually.
# Auto$origin <- factor(Auto$origin, labels = c("American", "European", "Japanese"))
lm.mpg.multi <- lm(mpg ~ . - name, data = Auto)
summary(lm.mpg.multi)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Is there a relationship between the predictors and response? - F-statistic is 252.4 which is >> 1, so there is indeed a relationship between the predictors and response. - The p-value associated with the F-statistic is also practically 0, so we can accept the F-statistic as significant. - This, in turn, means that we can reject the null hypothesis that the coefficients associated with all the predictors are 0.
Which predictors appear t have a statistically significant relationship with the response? - Not all predictors are associated with the response. - Specifically, cylinders, horsepower, acceleration have very large -values, suggesting no statistically significant association between these variables and the data.
What does the coefficient for the year variable suggest? The coefficient for year is
lm.mpg.multi$coeff['year']
## year
## 0.7507727
which suggests that for every additional year, the mpg increases by 0.777 units. This could mean that as cars continue to be used year-on-year, their mileage improves ever so slightly.
Use plot to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Deos the leverage plot identify any observations with very high leverage?
par(mfrow = c(2,2))
plot(lm.mpg)
#### Comments - Residuals show evidence of non-linearity in the data. We may have to use polynomial regresssion by creating higher order terms for some features. - Standardized residuals also so non-linearity, although they are between -3 and +3. - There are a lot of data points close to Cook’s distance, which suggests we do have data points with high leverage. - We also have data points with high leverage as well as high residuals. These have been extracted programmatically below.
lm.mpg.multi.high.leverage <- broom::augment(lm.mpg.multi) %>%
dplyr::select(.rownames, .fitted, .hat, .resid, .std.resid, .cooksd) %>%
dplyr::arrange(desc(.cooksd))
lm.mpg.multi.high.leverage %>% head(10)
## # A tibble: 10 x 6
## .rownames .fitted .hat .resid .std.resid .cooksd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 14 18.9 0.190 -4.88 -1.63 0.0778
## 2 394 34.5 0.0492 9.54 2.94 0.0558
## 3 327 31.5 0.0287 11.9 3.63 0.0488
## 4 387 28.4 0.0333 9.57 2.92 0.0368
## 5 326 32.9 0.0196 11.4 3.44 0.0296
## 6 245 32.1 0.0195 11.0 3.34 0.0278
## 7 323 33.5 0.0137 13.1 3.95 0.0271
## 8 112 27.6 0.0242 -9.59 -2.92 0.0264
## 9 330 35.0 0.0210 9.64 2.93 0.0230
## 10 45 6.25 0.0382 6.75 2.07 0.0213
Using broom and ggplot2, data points with CooksD - a common threshold - are idenified.
lm.mpg.multi.high.leverage$cooksd_cutoff <- factor(
ifelse(lm.mpg.multi.high.leverage$.cooksd >= 4 / nrow(Auto), 'True', 'False'
)
)
ggplot(lm.mpg.multi.high.leverage, aes(x = .hat, y = .std.resid, col = cooksd_cutoff)) +
geom_point() +
theme(legend.position = 'bottom') +
labs(
title = 'Auto - Residuals vs Leverage',
x = 'Leverage', y = 'Standardized Residuals',
col = 'Cooks Distance >= 4/n'
)
###Part(e) Use the
* and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
lm.mpg.multi.ixn <- lm(mpg ~ . *., data = Auto[, -9]) # Using all possible interactions between predictors
summary(lm.mpg.multi.ixn)
##
## Call:
## lm(formula = mpg ~ . * ., data = Auto[, -9])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
acceleration:origincceleration:yeardisplacement:yearTry a few different transformations of the variables such as , , . Report your finding.
# Write a simple function to identify best predictors, as reported on
# https://rpubs.com/lmorgan95/ISLR_CH3_Solutions
best_predictor <- function(dataframe, response) {
if (sum(sapply(dataframe, function(x) {is.numeric(x) | is.factor(x)})) < ncol(dataframe)) {
stop("Make sure that all variables are of class numeric/factor!")
}
# pre-allocate vectors
varname <- c()
vartype <- c()
R2 <- c()
R2_log <- c()
R2_quad <- c()
AIC <- c()
AIC_log <- c()
AIC_quad <- c()
y <- dataframe[ ,response]
# # # # # NUMERIC RESPONSE # # # # #
if (is.numeric(y)) {
# Iterate over each column of the dataframe
for (i in 1:ncol(dataframe)) {
# Extract the column
x <- dataframe[ ,i]
# Extract the column name
varname[i] <- names(dataframe)[i]
# Define the class of the column
if (class(x) %in% c("numeric", "integer")) {
vartype[i] <- "numeric"
} else {
vartype[i] <- "categorical"
}
# If the column is not the same as the response (why check for this?)
if (!identical(y, x)) {
# R-squared statistic for the linear: y ~ x
R2[i] <- summary(lm(y ~ x))$r.squared
# R-squared statistic for the log-transformation: y ~ log(x)
if (is.numeric(x)) {
# if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
if (min(x) <= 0) {
R2_log[i] <- summary(lm(y ~ log(x + abs(min(x)) + 1)))$r.squared
} else {
R2_log[i] <- summary(lm(y ~ log(x)))$r.squared
}
} else {
# Don't need to do anything for categorical variables
R2_log[i] <- NA
}
# R-squared statistic for the quadratic transformation: y ~ x + x^2
if (is.numeric(x)) {
R2_quad[i] <- summary(lm(y ~ x + I(x^2)))$r.squared
} else {
R2_quad[i] <- NA
}
} else {
# If the target and the predictor columns are identical, then populate with NAs
R2[i] <- NA
R2_log[i] <- NA
R2_quad[i] <- NA
}
}
print(paste("Response variable:", response))
data.frame(varname,
vartype,
R2 = round(R2, 3),
R2_log = round(R2_log, 3),
R2_quad = round(R2_quad, 3)) %>%
mutate(max_R2 = pmax(R2, R2_log, R2_quad, na.rm = T)) %>%
arrange(desc(max_R2))
# # # # # CATEGORICAL RESPONSE # # # # #
} else {
for (i in 1:ncol(dataframe)) {
# As before,iterate over all columns in the dataframe and extract each column one-by-one
x <- dataframe[ ,i]
varname[i] <- names(dataframe)[i]
# Populate class as ebefore
if (class(x) %in% c("numeric", "integer")) {
vartype[i] <- "numeric"
} else {
vartype[i] <- "categorical"
}
# For all columns except the response
if (!identical(y, x)) {
# Get the Akaike Information Criterion for the linear model: y ~ x
# AIC is derived using a generalized linear model
AIC[i] <- summary(glm(y ~ x, family = "binomial"))$aic
# Get the Akaike Information Criterion for the log-transform: y ~ log(x)
if (is.numeric(x)) {
if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
AIC_log[i] <- summary(glm(y ~ log(x + abs(min(x)) + 1), family = "binomial"))$aic
} else {
AIC_log[i] <- summary(glm(y ~ log(x), family = "binomial"))$aic
}
} else {
AIC_log[i] <- NA
}
# Get the Akaike Information Criteration for the quadratic transformatio : y ~ x + x^2
if (is.numeric(x)) {
AIC_quad[i] <- summary(glm(y ~ x + I(x^2), family = "binomial"))$aic
} else {
AIC_quad[i] <- NA
}
} else {
AIC[i] <- NA
AIC_log[i] <- NA
AIC_quad[i] <- NA
}
}
print(paste("Response variable:", response))
data.frame(varname,
vartype,
AIC = round(AIC, 3),
AIC_log = round(AIC_log, 3),
AIC_quad = round(AIC_quad, 3)) %>%
mutate(min_AIC = pmin(AIC, AIC_log, AIC_quad, na.rm = T)) %>%
arrange(min_AIC)
}
}
Running the best_predictor function over the entire dataset yields these results.
best_predictor(Auto[, -9], "mpg")
## [1] "Response variable: mpg"
## varname vartype R2 R2_log R2_quad max_R2
## 1 weight numeric 0.693 0.713 0.715 0.715
## 2 displacement numeric 0.648 0.686 0.689 0.689
## 3 horsepower numeric 0.606 0.668 0.688 0.688
## 4 cylinders numeric 0.605 0.603 0.607 0.607
## 5 year numeric 0.337 0.332 0.368 0.368
## 6 origin numeric 0.319 0.330 0.332 0.332
## 7 acceleration numeric 0.179 0.190 0.194 0.194
## 8 mpg numeric NA NA NA NA
Visualizing the feature importances of the raw features along with their log-transformed and quadratic versions.
# Get the R2 statistic for raw feature, as well as log transformed and quadratic transformed versions
mpg_predictors <- best_predictor(Auto[, -9], "mpg") %>%
dplyr::select(-c(vartype, max_R2)) %>%
tidyr::gather(key = "key", value = "R2", -varname) %>%
dplyr::filter(!is.na(R2))
## [1] "Response variable: mpg"
# Map each feature to a level based on their max R2
mpg_predictors_order <- best_predictor(Auto[, -9], "mpg") %>%
dplyr::select(varname, max_R2) %>%
dplyr::filter(!is.na(max_R2)) %>%
dplyr::arrange(desc(max_R2)) %>%
dplyr::pull(varname)
## [1] "Response variable: mpg"
# Make the varname in the features df a factor
mpg_predictors[['varname']] <- factor(mpg_predictors$varname, ordered = T, levels = mpg_predictors_order)
ggplot(mpg_predictors, aes(x = R2, y = varname, col = key, group = varname)) +
geom_line(col = 'grey15') +
geom_point(size = 2) +
theme_light() +
theme(legend.position = 'bottom') +
labs(title = "Best Predictors (& Transformations) for mpg",
col = "Predictor Transformation",
y = "Predictor")
Based on these R^2 variables, we may want to use quadratic transformatins onf
displacement, weight, year, and horspepower.
lm.mpg.multi.quad <- lm(mpg ~ . - name + I(weight^2) + I(displacement^2) + I(horsepower^2) + I(year^2), data = Auto)
summary(lm.mpg.multi.quad)
##
## Call:
## lm(formula = mpg ~ . - name + I(weight^2) + I(displacement^2) +
## I(horsepower^2) + I(year^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8607 -1.4222 0.0293 1.3596 11.7816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.121e+02 6.969e+01 5.912 7.51e-09 ***
## cylinders 5.892e-01 3.154e-01 1.868 0.062514 .
## displacement -4.415e-02 1.929e-02 -2.289 0.022606 *
## horsepower -1.861e-01 3.928e-02 -4.737 3.06e-06 ***
## weight -9.982e-03 2.485e-03 -4.016 7.13e-05 ***
## acceleration -1.837e-01 9.631e-02 -1.907 0.057279 .
## year -1.003e+01 1.838e+00 -5.455 8.86e-08 ***
## origin 5.900e-01 2.558e-01 2.307 0.021594 *
## I(weight^2) 1.052e-06 3.344e-07 3.146 0.001784 **
## I(displacement^2) 7.243e-05 3.324e-05 2.179 0.029954 *
## I(horsepower^2) 4.604e-04 1.331e-04 3.458 0.000605 ***
## I(year^2) 7.090e-02 1.207e-02 5.875 9.25e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.782 on 380 degrees of freedom
## Multiple R-squared: 0.8765, Adjusted R-squared: 0.873
## F-statistic: 245.3 on 11 and 380 DF, p-value: < 2.2e-16
This improves the adjusted R^2 quite significantly from ~80% to 87.3%. However, there is still evidence of non-linearity in residuals i.e. a fan shape in the standardized residuals and the deviation from the diagonal in the quantile-quantile plot.
par(mfrow = c(2, 2))
plot(lm.mpg.multi.quad)
Trying to resolve this with a log transform of the response.
lm.mpg.multi.log <- lm(log(mpg) ~ . - name, data = Auto)
summary(lm.mpg.multi.log)
##
## Call:
## lm(formula = log(mpg) ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40955 -0.06533 0.00079 0.06785 0.33925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.751e+00 1.662e-01 10.533 < 2e-16 ***
## cylinders -2.795e-02 1.157e-02 -2.415 0.01619 *
## displacement 6.362e-04 2.690e-04 2.365 0.01852 *
## horsepower -1.475e-03 4.935e-04 -2.989 0.00298 **
## weight -2.551e-04 2.334e-05 -10.931 < 2e-16 ***
## acceleration -1.348e-03 3.538e-03 -0.381 0.70339
## year 2.958e-02 1.824e-03 16.211 < 2e-16 ***
## origin 4.071e-02 9.955e-03 4.089 5.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1191 on 384 degrees of freedom
## Multiple R-squared: 0.8795, Adjusted R-squared: 0.8773
## F-statistic: 400.4 on 7 and 384 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(lm.mpg.multi.log)
Final model makes the following changes - log transform the response - use quadratic predictors for features that are related non-linearly to log(mpg) - use the most significant interaction terms - use a brand variable from the name of the vehicle
Auto_2 <- Auto %>% mutate(mpg = log(mpg))
# Extract the first item from each list element
Auto_2$brand <- sapply(strsplit(as.character(Auto_2$name), split = " "), function (x){x[1]})
# Fixing typos
Auto_2$brand <- factor(ifelse(Auto_2$brand %in% c("vokswagen", "vw"), "volkswagen",
ifelse(Auto_2$brand == "toyouta", "toyota",
ifelse(Auto_2$brand %in% c("chevroelt", "chevy"), "chevrolet",
ifelse(Auto_2$brand == "maxda", "mazda",
Auto_2$brand)))))
# Collapse into 10 categories
Auto_2$brand <- forcats::fct_lump(Auto_2$brand, n = 9, other_level = "uncommon")
# Don't need the name feature if we're using brand
Auto_2$name <- NULL
# Fit the model with the best interactions and transfrmations
lm.mpg.final <- lm(mpg ~ . + I(horsepower^2) + I(year^2) + acceleration:year + acceleration:origin, data = Auto_2)
summary(lm.mpg.final)
##
## Call:
## lm(formula = mpg ~ . + I(horsepower^2) + I(year^2) + acceleration:year +
## acceleration:origin, data = Auto_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37893 -0.05907 -0.00022 0.05997 0.34458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.504e+01 2.651e+00 5.671 2.86e-08 ***
## cylinders -6.190e-03 1.109e-02 -0.558 0.577147
## displacement -2.793e-04 2.723e-04 -1.026 0.305691
## horsepower -7.975e-03 1.343e-03 -5.937 6.69e-09 ***
## weight -1.392e-04 2.537e-05 -5.487 7.60e-08 ***
## acceleration -1.709e-01 4.509e-02 -3.790 0.000176 ***
## year -2.726e-01 7.080e-02 -3.851 0.000139 ***
## origin -1.898e-01 5.847e-02 -3.246 0.001276 **
## brandbuick 6.736e-02 3.375e-02 1.996 0.046691 *
## brandchevrolet 3.399e-02 2.585e-02 1.315 0.189367
## branddatsun 1.802e-01 3.959e-02 4.551 7.23e-06 ***
## branddodge 5.784e-02 2.957e-02 1.956 0.051239 .
## brandford -5.201e-03 2.592e-02 -0.201 0.841093
## brandplymouth 6.287e-02 2.854e-02 2.203 0.028228 *
## brandtoyota 1.039e-01 3.863e-02 2.691 0.007454 **
## brandvolkswagen 7.267e-02 3.448e-02 2.108 0.035729 *
## branduncommon 8.069e-02 2.589e-02 3.117 0.001971 **
## I(horsepower^2) 1.752e-05 4.231e-06 4.141 4.29e-05 ***
## I(year^2) 1.788e-03 4.789e-04 3.733 0.000219 ***
## acceleration:year 1.828e-03 5.963e-04 3.065 0.002335 **
## acceleration:origin 1.116e-02 3.488e-03 3.201 0.001490 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1051 on 371 degrees of freedom
## Multiple R-squared: 0.9093, Adjusted R-squared: 0.9044
## F-statistic: 186 on 20 and 371 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(lm.mpg.final)
## Exercise 10: Carseats (Multiple Linear Regression) This question should be answered using the
Carseats data set.
Fit a multiple regression model to predict Sales using Price, Urban, and US.
lm.sales <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm.sales)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Provide an interpretation of each coefficient in the model. - Intercept (13.04): The baseline Sale price is 13. That is, assuming a constant price and assuming the sale was made for a store outside the US and in a non-Urban area, the average sale price was 13.04. - Price: (-0.05445): A one unit increase in price results in a decrease of 54 units of sales across all markets. - UrbanYes: (-0.0219): If the store is located in an Urban location, then there are, on average, 22 fewer sales expected. However, since the p-value associated with this effect is high, there is no statistical significance of this effect. - USYes: (1.200): Quantifies the increase in sales if a store is located in the US. If a store is located in the US, then the store is expected to have 1200 additional sales.
Write out the model in equation form.
For which of the predictors can you reject the null hypothesi ? - We can reject the null hypothesis for the Price and US predictors because their p-values are quite small. - We cannot reject the null hypothesis for the Urban predictor because there does not seem to be a statistically significant relationship between this variable and the response.
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
lm.sales.smaller <- lm(Sales ~ Price + US, data = Carseats)
summary(lm.sales.smaller)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
How well do the models in (a) and (e) fit the data? - Model (a): lm.sales - has an adjusted R-squared statistic of 0.2335, RSE of 2.472 - Model (b): lm.sales.smaller - has an adjusted R-squared statistic of 0.2354, RSE of 2.469
The reason a has a higher RSE than e because the adjusted R^2 statistic is dependent on minimizing . Here, the number increase in the denominator is not offset by the decrease in the RSS (numerator). RSS will always decrease with the addition of a new variable, but adjusted does not.
Using the model from (e), obtain the 95% confidence intervals for the coefficients..
The 95% confidence interval for all coefficients are defined as follows n - 2 degrees of freedom, n being the number of data points.
In R, this is computed as follows:
confint(lm.sales.smaller, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2, 2))
plot(lm.sales.smaller)
The initial diagnostic plot shows there are many data points beyond Cooks distance n the
Residuals vs Leverage plot.
Exploring these in more detal.
# Leverage statistic is always equal to (p + 1) / n, so we can identify points with high leverage using this threshold
# hatvaluse gets the leverage for each data point
round(((2 + 1) / nrow(Carseats)), 10) == round(mean(hatvalues(lm.sales.smaller)), 10)
## [1] TRUE
Identify any potential outliers as data points whose standardized residual is outside [-2, 2].
broom::augment(lm.sales.smaller) %>%
dplyr::select(.hat, .std.resid) %>%
ggplot(aes(x = .hat, y = .std.resid)) +
geom_point() +
geom_hline(yintercept = -2, size = 1, color = 'red') +
geom_hline(yintercept = 2, size = 1, color = 'red') +
geom_vline(xintercept = 3 / nrow(Carseats), size = 1, color = 'blue') +
labs(x = 'Leverage', y = 'Standardized Residual', title = 'Residuals vs Leverage')
A more quantitative approach is to use Cook’s distance.
broom::augment(lm.sales.smaller) %>%
tibble::rownames_to_column("rowid") %>%
dplyr::arrange(desc(.cooksd)) %>%
dplyr::select(Sales, Price, US, .std.resid, .hat, .cooksd)
## # A tibble: 400 x 6
## Sales Price US .std.resid .hat .cooksd
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 14.9 82 No 2.58 0.0116 0.0261
## 2 14.4 53 No 1.73 0.0237 0.0243
## 3 10.6 149 No 2.32 0.0126 0.0228
## 4 15.6 72 Yes 2.17 0.0129 0.0205
## 5 0.37 191 Yes -1.42 0.0286 0.0198
## 6 16.3 92 Yes 2.87 0.00664 0.0183
## 7 3.47 81 No -2.10 0.0119 0.0177
## 8 0.53 159 Yes -2.05 0.0119 0.0169
## 9 13.6 89 No 2.18 0.00983 0.0158
## 10 3.02 90 Yes -2.56 0.00710 0.0157
## # … with 390 more rows
In this problem, we will investigate the t-statistic for the null hypothesis in a simple linear regression model without an intercept. To begin, we generate a predictor and a response as follows.
set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)
Predict y using x without any intercept. Report the coefficient estimate , the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis . Comment on these results.
lm.no.intercept.01 <- lm(y ~ x + 0)
summary(lm.no.intercept.01)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
These statistics suggest that there is strong evidence of a relationship between the predictor and the response. This, in turn, means that the least squares line is of the form
Perform a simple linear regression of x onto y without an intercept. Report the same information as in the previous question.
lm.no.intercept.02 <- lm(x ~ y + 0)
summary(lm.no.intercept.02)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8699 -0.2368 0.1030 0.2858 0.8938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.39111 0.02089 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
This regression analysis also provides significant evidence against the null hypothesis: there is strong evidence of a relationship between and .
What is the relationship between the results obtained in (a) and (b)?
This link has some great analysis, which I will replicate here to improve my own understanding.
Both models have the same t-statstic associated with their coefficients, but the values of the coefficient estimates and their standard errors are different.
This is because in the first model, we are regressing onto and in the second model we are regressing onto .
In case of the first model, the linear relationship is . Algebraically reversing this gives us . In this case, we’d expect the coefficient to be 0.5 but the regression gives us an estimate of 0.399, which isn’t close to 0.5.
This is because there are different levels of noise in the data, and they have different effects on the linear regression estimates.
When we regress onto , RSS = . Minimizing this vertical distance yields a slightly flatter line.
data <- data.frame(x, y)
ggplot(data, aes(x, y)) +
geom_point(color = 'red') +
geom_segment(aes(x = x, y = y, xend = x, yend = lm.no.intercept.01$fitted.values)) +
geom_abline(intercept = 0, slope = coef(lm.no.intercept.01), size = 1) +
labs(title = 'LM1 - Predicting `y` using `x`')
In the case of the second model, we are minimising . This gives us a line with a steeper slope.
ggplot(data, aes(x, y)) +
geom_point(color="red") +
geom_segment(aes(x = x, y = y,
xend = lm.no.intercept.02$fitted.values, yend = y)) +
geom_abline(intercept = 0, slope = 1 / coef(lm.no.intercept.02), size = 1) +
labs(title = "LM2: predicting x using y")
As the noise in the dataset will decrease, we will see the two lines converge to each other.
ggplot(data, aes(x, y)) +
geom_point(color="red") +
geom_abline(intercept = 0, slope = coef(lm.no.intercept.01), size = 1, col = "deepskyblue3") +
geom_abline(intercept = 0, slope = 1 / coef(lm.no.intercept.02), size = 1, col = "mediumseagreen") +
labs(title = "LM1 vs LM2")
Recall that the coefficient estimate for the linear regression of onto without an intercept is given by equation 3.38. Under what circumstance is the coefficient estimate for the regression of onto the same as the coefficient estimate for the regression of onto ?
This means the coefficients of regressing onto and onto will be equal when the sum of squares of the predictors in the regression equation are the same.
Generate an example in R with = 100 observations in which the coefficient estimate for the regression of onto is different from the coefficient estimate of onto .
# This was actually done in the previous question, but we'll do it again.
set.seed(2)
x <- rnorm(100)
y <- 5 * x + rnorm(100, sd = 2)
data <- data.frame(x, y)
# Fit regressors
lm.y.onto.x <- lm(y ~ x + 0)
lm.x.onto.y <- lm(x ~ y + 0)
# Confirm sum of squares is different
sum.squares.x <- sum(x ^ 2)
sum.squares.y <- sum(y ^ 2)
cat(paste0("Sum of Squares of x: \t", round(sum.squares.x, 3), " | Sum of Squares of y:\t", round(sum.squares.y, 3)))
## Sum of Squares of x: 133.352 | Sum of Squares of y: 3591.294
# Get the coefficients for the slopes
slope.coef.y.onto.x <- coef(lm.y.onto.x)
slope.coef.x.onto.y <- coef(lm.x.onto.y)
cat(paste0("\nSlope of y ~ x:\t", round(slope.coef.y.onto.x, 3), " | Sum of Squares of x ~ y:\t", round(slope.coef.x.onto.y, 3)))
##
## Slope of y ~ x: 4.905 | Sum of Squares of x ~ y: 0.182
# Can also plot the lines
ggplot(data, aes(x , y)) +
geom_point(color = 'red') +
geom_abline(intercept = 0, slope = coef(lm.y.onto.x), size = 1, col = 'deepskyblue3') +
geom_abline(intercept = 0, slope = 1 / coef(lm.x.onto.y), size = 1, col = 'mediumseagreen') +
labs(title = 'Regression Lines (y ~ x (blue) and x ~ y (green)))')
Generate an example in R with = 100 observations in which the coefficient estimate for the regression of ont is the same as the coefficient estimate for the regression of onto .
# When predictor and response variables are equal, this will always be the case
set.seed(3)
x <- rnorm(100)
y <- x
data <- data.frame(x, y)
lm.y.onto.x <- lm(y ~ x + 0)
lm.x.onto.y <- lm(x ~ y + 0)
# Confirm sum of squares is the same
sum.squares.x <- sum(x ^ 2)
sum.squares.y <- sum(y ^ 2)
cat(paste0("Sum of Squares of x: \t", round(sum.squares.x, 3), " | Sum of Squares of y:\t", round(sum.squares.y, 3)))
## Sum of Squares of x: 72.566 | Sum of Squares of y: 72.566
# Get the coefficients for the slopes
slope.coef.y.onto.x <- coef(lm.y.onto.x)
slope.coef.x.onto.y <- coef(lm.x.onto.y)
cat(paste0("\nSlope of y ~ x:\t", round(slope.coef.y.onto.x, 3), " | Sum of Squares of x ~ y:\t", round(slope.coef.x.onto.y, 3)))
##
## Slope of y ~ x: 1 | Sum of Squares of x ~ y: 1
# Can also plot the lines
ggplot(data, aes(x , y)) +
geom_point(color = 'red') +
geom_abline(intercept = 0, slope = coef(lm.y.onto.x), size = 1, col = 'deepskyblue3') +
geom_abline(intercept = 0, slope = 1 / coef(lm.x.onto.y), size = 1, col = 'mediumseagreen') +
labs(title = 'Regression Lines (y ~ x (blue) and x ~ y (green))')
In this exercise, you will create some simulated data and will fit simple linear regression models to it.
Using the rnorm function, create a vector containing 100 observations drawn from
set.seed(1)
x <- rnorm(100, mean = 0, sd = 1)
Using the rnorm function, create a vector eps containing 100 observations drawn from a distribution i.e. a normal distribution with mean zeor and variance 0.25.
eps <- rnorm(100, mean = 0, sd = sqrt(0.25)) # variance = sd ^ 2
Using x and eps, generate a vector y according to the model . What is the length of the vector ? What are the values of and in this linear model?
# Creating target
y <- -1 + 0.5 * x + eps
# Confirming that 100 elements created
length(y) == length(x) & length(y) == length(eps)
## [1] TRUE
# True parameters are B_0 = -1 and B_1 = -1
Create a scatterplot displaying the relationship between x and y. Comment on what you observe.
data <- data.frame(x = x, y = y)
ggplot(data, aes(x = x, y = y)) + geom_point()
x increases, y also increases.Fit a least squares linear model to predict y using x. Comment on the model obtained. How do and compare to and ?
lm.simulated <- lm(y ~ x, data = data)
summary(lm.simulated)
##
## Call:
## lm(formula = y ~ x, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93842 -0.30688 -0.06975 0.26970 1.17309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.01885 0.04849 -21.010 < 2e-16 ***
## x 0.49947 0.05386 9.273 4.58e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared: 0.4674, Adjusted R-squared: 0.4619
## F-statistic: 85.99 on 1 and 98 DF, p-value: 4.583e-15
The slope and intercept coefficients are very close to the actual values. The deviations are primarily due to the noise in the data. p-values associated with both coefficients are practically 0, so we have no reason to believe these associations are due to chance (as expected).
However, the is low and is relatively high - this is due to the noise in the data.
Display the least squares line on the scatterplot obtained in (d). Draw the population regression line in the plot with a different color.
g1 <- ggplot(data, aes(x = x, y = y)) +
geom_point()
g1 +
geom_abline(aes(intercept = -1, slope = 0.5, col = "Population")) +
geom_abline(aes(intercept = coef(lm.simulated)[1], slope = coef(lm.simulated)[2], col = "Least Squares")) +
scale_colour_manual(name = "Regression Line:", values = c("red", "blue")) +
theme(legend.position = "bottom") +
labs(title = "sd(eps) = 0.5")
Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the quadratic term improves the model fit?
lm.simulated.quad <- lm(y ~ x + I(x^2), data = data)
summary(lm.simulated.quad)
##
## Call:
## lm(formula = y ~ x + I(x^2), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98252 -0.31270 -0.06441 0.29014 1.13500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.97164 0.05883 -16.517 < 2e-16 ***
## x 0.50858 0.05399 9.420 2.4e-15 ***
## I(x^2) -0.05946 0.04238 -1.403 0.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.479 on 97 degrees of freedom
## Multiple R-squared: 0.4779, Adjusted R-squared: 0.4672
## F-statistic: 44.4 on 2 and 97 DF, p-value: 2.038e-14
Repeat parts (a) to (f) after modfying the data generation process in such a way that there is less noise in the data. The model (3.39) should remain the same.
# Replicating code from above but with a smaller variance - standard linear model
set.seed(1)
x <- rnorm(100)
eps <- rnorm(100, sd = 0.1) # Now 0.1 instead of 0.5
y <- -1 + 0.5*x + eps
data <- data.frame(x, y)
lm.simulated.low.var <- lm(y ~ x)
ggplot(data, aes(x = x, y = y)) +
geom_point() +
geom_abline(aes(intercept = -1, slope = 0.5, col = "Population")) +
geom_abline(aes(intercept = coef(lm.simulated.low.var)[1],
slope = coef(lm.simulated.low.var)[2], col = "Least Squares")) +
scale_colour_manual(name = "Regression Line:", values = c("red", "blue")) +
theme(legend.position = "bottom") +
labs(title = "sd(eps) = 0.1")
summary(lm.simulated.low.var)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.18768 -0.06138 -0.01395 0.05394 0.23462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.003769 0.009699 -103.5 <2e-16 ***
## x 0.499894 0.010773 46.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09628 on 98 degrees of freedom
## Multiple R-squared: 0.9565, Adjusted R-squared: 0.956
## F-statistic: 2153 on 1 and 98 DF, p-value: < 2.2e-16
Repeat parts (a) to (f) after modfying the data generation process in such a way that there is more noise in the data. The model (3.39) should remain the same.
set.seed(1)
x <- rnorm(100)
eps <- rnorm(100, sd = 0.9) # Now 0.1 instead of 0.5
y <- -1 + 0.5*x + eps
data <- data.frame(x, y)
lm.simulated.high.var <- lm(y ~ x)
ggplot(data, aes(x = x, y = y)) +
geom_point() +
geom_abline(aes(intercept = -1, slope = 0.5, col = "Population")) +
geom_abline(aes(intercept = coef(lm.simulated.high.var)[1],
slope = coef(lm.simulated.high.var)[2], col = "Least Squares")) +
scale_colour_manual(name = "Regression Line:", values = c("red", "blue")) +
theme(legend.position = "bottom") +
labs(title = "sd(eps) = 0.1")
summary(lm.simulated.high.var)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6892 -0.5524 -0.1255 0.4855 2.1116
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.03392 0.08729 -11.845 < 2e-16 ***
## x 0.49905 0.09695 5.147 1.36e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8665 on 98 degrees of freedom
## Multiple R-squared: 0.2128, Adjusted R-squared: 0.2048
## F-statistic: 26.49 on 1 and 98 DF, p-value: 1.362e-06
What are the confidence intervals for , based on the original dataset, the dataset with less noise, and the dataset with more noise?
# For the original model
confint(lm.simulated)
## 2.5 % 97.5 %
## (Intercept) -1.1150804 -0.9226122
## x 0.3925794 0.6063602
# For the model with low variance
confint(lm.simulated.low.var)
## 2.5 % 97.5 %
## (Intercept) -1.0230161 -0.9845224
## x 0.4785159 0.5212720
# For the model with high variance
confint(lm.simulated.high.var)
## 2.5 % 97.5 %
## (Intercept) -1.2071447 -0.8607020
## x 0.3066429 0.6914484
This problem focuses on the collinearity problem. ### Part (a) Write out the form of the linear model. What are the regression coefficients?.
set.seed(1)
x1 <- runif(100) # First predictor - random numbers
x2 <- 0.5 * x1 + rnorm(100) / 10 # Second predictor - correlated
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100) # Target is combination of the two
The equation is of the form
The coefficients are , , .
What is the correlation between and ? Create a scatterplot displaying the relationship between the variables.
corr.value <- cor(x1, x2)
plot(x = x1, y = x2, main = paste0("Scatterplot - x1 vs x2 | Correlation = ", round(corr.value, 2)))
The correlation between and is 0.84.
Using this data, fit a least squares regression to predict using and . Describe the results obtained. What are the predicted coefficients? Hwo do they relate to the true coefficients? Can you reject the null hypothesis ? How about the null hypothesis ?
lm.corr.fit <- lm(y ~ x1 + x2)
summary(lm.corr.fit)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
**Now fit a least squares regression to predict using only . Comment on your results. Can you reject the null hypothesis? H_0 : _1 = 0$**
lm.corr.x1 <- lm(y ~ x1)
summary(lm.corr.x1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
Now fit a least squares regression to predict using only . Comment on your results. Can you reject the null hypothesis ?
lm.corr.x2 <- lm(y ~ x2)
summary(lm.corr.x2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
Do the results obtained in (c) - (e) contradict each other? Explain your answer.
lm.corr.fit suggests only is a statistically significant predictor for .lm.corr.x2 suggests that is also statistically significant.Now suppose we obtain one additional observation with was unfortunately mismeasured as follows.
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
Refit the linear models from (c) to (e) using this new data. What effect does this new observation have on each of the models? In each model, is this observation an outlier, a high-leverage point, or both? Explain your answers.
lm.corr.x1.x2 <- lm(y ~ x1 + x2)
lm.corr.x1 <- lm(y ~ x1)
lm.corr.x2 <- lm(y ~ x2)
# For easier processing, combine the data into a dataframe
data <- data.frame(y, x1, x2, new = 0)
data$new[length(data$new)] <- 1
get_summary_and_leverage <- function(lm_fit, title_str){
# First, print the coefficients summary
message(paste0("LM Summary - ", title_str))
print(summary(lm_fit))
# Then plot the leverage
plt <- broom::augment(lm_fit) %>%
cbind(new = data$new) %>%
ggplot(aes(x = .hat, y = .std.resid, col = factor(new))) +
geom_point() +
geom_hline(yintercept = -2, col = "deepskyblue3", size = 1, alpha = 0.5) +
geom_hline(yintercept = 2, col = "deepskyblue3", size = 1, alpha = 0.5) +
geom_vline(xintercept = 3 / nrow(data), col = "mediumseagreen", size = 1, alpha = 0.5) +
scale_color_manual(values = c("black", "red")) +
theme_light() +
theme(legend.position = "none") +
labs(x = "Leverage",
y = "Standardized Residual",
title = paste0("Residuals vs Leverage: ", title_str))
print(plt)
}
get_summary_and_leverage(lm.corr.x1.x2, title_str = 'y ~ x1 + x2')
## LM Summary - y ~ x1 + x2
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
- In the original model, only was significant while was statistically insignificant. - With the addition of the new data point, has become statistically insignificant and is statistically significant. - Coefficient of is now 0.5394 and coefficient of is now 2.5146. - Intercept has also been skewed to become 2.2267. - The data point has a high standardized residual as well as high leverage, which explains why the results have changed so significanltly.
get_summary_and_leverage(lm.corr.x1, title_str = 'y ~ x_1')
## LM Summary - y ~ x_1
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
- The intercept and coefficient terms have changed in magnitude, but are both still significant. - The data point has relatively high leverage, and also has a high standardized residual so it can be considered both an outlier and high leverage data point.
get_summary_and_leverage(lm.corr.x2, 'y ~ x_2')
## LM Summary - y ~ x_2
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
- Intercept has changed, and the coefficient for is much larger. - is still significant. - Data point has a high leverage, but is not an outlier because it’s standardized residual is between -2 and +2.
BostonThis problem involves the Boston data set. We will try to predict the per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.
### Part (a) For each predictor, fit a simple linear regression model to predict the response. Describe your result. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertion.
# Read in the Boston dataset
Boston <- MASS::Boston
# Convert the `chas` (Charles River bounding dummy variable) to factor
Boston <- MASS::Boston
Boston$chas <- factor(Boston$chas)
# Write a simple function to iterate over all predictors and return lm.summary statistics
R2 <- c()
P_value <- c()
Slope <- c()
y <- Boston$crim
for (i in 1:ncol(Boston)) {
x <- Boston[ ,i]
if (!identical(y, x)) {
# linear: y ~ x
R2[i] <- summary(lm(y ~ x))$r.squared
P_value[i] <- summary(lm(y ~ x))$coefficients[2, 4]
Slope[i] <- summary(lm(y ~ x))$coefficients[2, 1]
} else {
R2[i] <- NA
P_value[i] <- NA
Slope[i] <- NA
}
}
crime_preds <- data.frame(varname = names(Boston),
R2 = round(R2, 5),
P_value = round(P_value, 10),
Slope = round(Slope, 5)) %>%
arrange(desc(R2))
crime_preds
## varname R2 P_value Slope
## 1 rad 0.39126 0.0000000000 0.61791
## 2 tax 0.33961 0.0000000000 0.02974
## 3 lstat 0.20759 0.0000000000 0.54880
## 4 nox 0.17722 0.0000000000 31.24853
## 5 indus 0.16531 0.0000000000 0.50978
## 6 medv 0.15078 0.0000000000 -0.36316
## 7 black 0.14827 0.0000000000 -0.03628
## 8 dis 0.14415 0.0000000000 -1.55090
## 9 age 0.12442 0.0000000000 0.10779
## 10 ptratio 0.08407 0.0000000000 1.15198
## 11 rm 0.04807 0.0000006347 -2.68405
## 12 zn 0.04019 0.0000055065 -0.07393
## 13 chas 0.00312 0.2094345015 -1.89278
## 14 crim NA NA NA
chas variable.Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predcitors can we reject the null hypothesis of ?
lm.boston.all <- lm(crim ~ ., data = Boston)
summary(lm.boston.all)
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.924 -2.120 -0.353 1.019 75.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.033228 7.234903 2.354 0.018949 *
## zn 0.044855 0.018734 2.394 0.017025 *
## indus -0.063855 0.083407 -0.766 0.444294
## chas1 -0.749134 1.180147 -0.635 0.525867
## nox -10.313535 5.275536 -1.955 0.051152 .
## rm 0.430131 0.612830 0.702 0.483089
## age 0.001452 0.017925 0.081 0.935488
## dis -0.987176 0.281817 -3.503 0.000502 ***
## rad 0.588209 0.088049 6.680 6.46e-11 ***
## tax -0.003780 0.005156 -0.733 0.463793
## ptratio -0.271081 0.186450 -1.454 0.146611
## black -0.007538 0.003673 -2.052 0.040702 *
## lstat 0.126211 0.075725 1.667 0.096208 .
## medv -0.198887 0.060516 -3.287 0.001087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.4396
## F-statistic: 31.47 on 13 and 492 DF, p-value: < 2.2e-16
rad, dis, medv, black, zn, and, just above the 5% threshold, nox. For all except nox, we can reject the null hypothesis.How do you results from (a) compare to your results from (b)? Create a plot displayng the univariate regression coefficients on the -axis and the multivariate regression coefficients on the -axis.
lm.boston.all.coefs <- summary(lm.boston.all)$coefficients %>% as.data.frame() %>%
dplyr::select(Estimate)
lm.boston.all.coefs[['varname']] <- rownames(lm.boston.all.coefs)
lm.boston.single.coefs <- crime_preds %>% dplyr::select(varname, Slope) %>%
dplyr::rename('Estimate' = 'Slope')
lm.boston.coefs.merge <- merge(
lm.boston.all.coefs, lm.boston.single.coefs,
by = c('varname'), suffixes = c('_multi', '_single')
)
lm.boston.coefs.merge %>%
ggplot(aes(x = Estimate_multi, y = Estimate_single, color = varname)) +
geom_point()
Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor , fit a model of the form
P_value_x <- c()
P_value_x2 <- c()
P_value_x3 <- c()
R2 <- c()
y <- Boston$crim
for (i in 1:ncol(Boston)) {
x <- Boston[ ,i]
if (is.numeric(x)) {
model <- lm(y ~ x + I(x^2) + I(x^3))
if (!identical(y, x)) {
P_value_x[i] <- summary(model)$coefficients[2, 4]
P_value_x2[i] <- summary(model)$coefficients[3, 4]
P_value_x3[i] <- summary(model)$coefficients[4, 4]
R2[i] <- summary(model)$r.squared
}
} else {
P_value_x[i] <- NA
P_value_x2[i] <- NA
P_value_x3[i] <- NA
R2[i] <- NA
}
}
data.frame(varname = names(Boston),
R2 = round(R2, 5),
P_value_x = round(P_value_x, 10),
P_value_x2 = round(P_value_x2, 10),
P_value_x3 = round(P_value_x3, 10)) %>%
filter(!varname %in% c("crim", "chas")) %>%
arrange(desc(R2)) %>%
mutate(relationship = case_when(P_value_x3 < 0.05 ~ "Cubic",
P_value_x2 < 0.05 ~ "Quadratic",
P_value_x < 0.05 ~ "Linear",
TRUE ~ "No Relationship"))
## varname R2 P_value_x P_value_x2 P_value_x3 relationship
## 1 medv 0.42020 0.0000000000 0.0000000000 0.0000000000 Cubic
## 2 rad 0.40004 0.6234175212 0.6130098773 0.4823137740 No Relationship
## 3 tax 0.36888 0.1097075249 0.1374681578 0.2438506811 No Relationship
## 4 nox 0.29698 0.0000000000 0.0000000000 0.0000000000 Cubic
## 5 dis 0.27782 0.0000000000 0.0000000000 0.0000000109 Cubic
## 6 indus 0.25966 0.0000529706 0.0000000003 0.0000000000 Cubic
## 7 lstat 0.21793 0.3345299858 0.0645873561 0.1298905873 No Relationship
## 8 age 0.17423 0.1426608270 0.0473773275 0.0066799154 Cubic
## 9 black 0.14984 0.1385871340 0.4741750826 0.5436171817 No Relationship
## 10 ptratio 0.11378 0.0030286627 0.0041195521 0.0063005136 Cubic
## 11 rm 0.06779 0.2117564139 0.3641093853 0.5085751094 No Relationship
## 12 zn 0.05824 0.0026122963 0.0937504996 0.2295386205 Linear
For variables where the P_value_x3 term is 0, the predictor has evidence of a cubic relationship.
Comments
Is there a relationship between the predictor and response? Yes. Thep -value associated with the predictor is very small, suggesting that assuming there was no relationship between the predictor and response, the probability of observing a strong association between the predictor and response due to random chance is very small.
Furthermore, the F-statistic of this regressor is >> 1, which also confirms that there is an association between the predictor and response.
How strong is the relationship between the predictor and the response?
response.mean <- mean(Auto$mpg) fit.rse <- summary(lm.mpg)$sigma print(paste0("Response Mean: ", round(response.mean, 4), " | RSE = ", round(fit.rse, 4), " | Percentage Error = ", round(fit.rse / response.mean * 100, 2)))The residual standard error is a bit high, which suggests that the relationship is moderate. The R-squared statistic is ~60%, which means 60% of the variance in the response was explained by regression, which is again evidence of moderate relationship.
Is the relationship between the predictor and response positive or negative? - The relationship is negative. - For every unit increase in
horsepower,mpgdecreases by -0.1578 units.What is the predicted
mpgassociated with ahorsepowerof 98? What are the associated 95% confidence and prediction intervals?# Confidence Interval rbind( data.frame( list( 'data' = 98, 'interval' = 'confidence', predict(lm.mpg, data.frame(horsepower = 98), interval = 'confidence') ) ), data.frame( list( 'data' = 98, 'interval' = 'prediction', predict(lm.mpg, data.frame(horsepower = 98), interval = 'prediction') ) ) )